Search CORE

Edinburgh Research Explorer

Interpreting linear support vector machine models with heat map molecule coloring

Author: A Bender
Andreas Jahn
Andreas Zell
B Schölkopf
C Steinbeck
D Bossemeyer
D Fourches
D Rogers
D Weininger
G Hinselmann
Georg Hinselmann
H Kubinyi
I Guyon
J Bajorath
J Kazius
J Mohr
J Orts
K Hasegawa
KD Freeman-Cook
KH Bleicher
L Han
L Prade
L Ralaivola
Lars Rosenbaum
MS Buchanan
N Fechner
P Jonathan
RE Fan
SG Rohrer
SJ Swamidass
SM Free
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Model-based virtual screening plays an important role in the early drug discovery stage. The outcomes of high-throughput screenings are a valuable source for machine learning algorithms to infer such models. Besides a strong performance, the interpretability of a machine learning model is a desired property to guide the optimization of a compound in later drug discovery stages. Linear support vector machines showed to have a convincing performance on large-scale data sets. The goal of this study is to present a heat map molecule coloring technique to interpret linear support vector machine models. Based on the weights of a linear model, the visualization approach colors each atom and bond of a compound according to its importance for activity. Results We evaluated our approach on a toxicity data set, a chromosome aberration data set, and the maximum unbiased validation data sets. The experiments show that our method sensibly visualizes structure-property and structure-activity relationships of a linear support vector machine model. The coloring of ligands in the binding pocket of several crystal structures of a maximum unbiased validation data set target indicates that our approach assists to determine the correct ligand orientation in the binding pocket. Additionally, the heat map coloring enables the identification of substructures important for the binding of an inhibitor. Conclusions In combination with heat map coloring, linear support vector machine models can help to guide the modification of a compound in later stages of drug discovery. Particularly substructures identified as important by our method might be a starting point for optimization of a lead compound. The heat map coloring should be considered as complementary to structure based modeling approaches. As such, it helps to get a better understanding of the binding mode of an inhibitor.</p

Springer - Publisher Connector

Discovering collectively informative descriptors from high-throughput experiments

Author: A Bhattacharjee
A Golbraikh
A Hess
A Sadanandam
A Tropsha
AJ Sutton
AN Pronin
B Millauer
B Singh
BA Jensen
CA Powell
Clark D Jeffries
CM Findley
DA Smirnov
DG Beer
Diana O Perkins
DR Cox
DR Rhodes
EP Kopantzev
Fred A Wright
GC Chang
HS Soifer
J Kazius
J Lamb
J Zar
JS Nam
L Cronbach
L He
M Blangiardo
M Selbach
P Greenwood
P Westfall
R Breitling
R Breitling
R Edgar
R Li
RS Stearman
T Barrett
William O Ward
YH Soung
Ø Langsrud
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Improvements in high-throughput technology and its increasing use have led to the generation of many highly complex datasets that often address similar biological questions. Combining information from these studies can increase the reliability and generalizability of results and also yield new insights that guide future research. Results This paper describes a novel algorithm called BLANKET for symmetric analysis of two experiments that assess informativeness of descriptors. The experiments are required to be related only in that their descriptor sets intersect substantially and their definitions of case and control are consistent. From resulting lists of n descriptors ranked by informativeness, BLANKET determines shortlists of descriptors from each experiment, generally of different lengths p and q. For any pair of shortlists, four numbers are evident: the number of descriptors appearing in both shortlists, in exactly one shortlist, or in neither shortlist. From the associated contingency table, BLANKET computes Right Fisher Exact Test (RFET) values used as scores over a plane of possible pairs of shortlist lengths <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. BLANKET then chooses a pair or pairs with RFET score less than a threshold; the threshold depends upon n and shortlist length limits and represents a quality of intersection achieved by less than 5% of random lists. Conclusions Researchers seek within a universe of descriptors some minimal subset that collectively and efficiently predicts experimental outcomes. Ideally, any smaller subset should be insufficient for reliable prediction and any larger subset should have little additional accuracy. As a method, BLANKET is easy to conceptualize and presents only moderate computational complexity. Many existing databases could be mined using BLANKET to suggest optimal sets of predictive descriptors.</p

Springer - Publisher Connector

Carolina Digital Repository

Comparative study of classification algorithms using molecular descriptors in toxicological databases

Author: A. Amini
A. Richard
A. Richard
A. White
C. Hansch
C. Russom
D. Bahler
D. Pugazhenthi
H. Fang
H. Waterbeemd van de
I.H. Witten
J. Dearden
J. Graham
J. Kazius
L. Gold
O. Ivanciuc
R. Guha
R. Todeschini
S. Ekins
S.J. Barrett
W. Duch
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The rational development of new drugs is a complex and expensive process, comprising several steps. Typically, it starts by screening databases of small organic molecules for chemical structures with potential of binding to a target receptor and prioritizing the most promising ones. Only a few of these will be selected for biological evaluation and further refinement through chemical synthesis. Despite the accumulated knowledge by pharmaceutical companies that continually improve the process of finding new drugs, a myriad of factors affect the activity of putative candidate molecules in vivo and the propensity for causing adverse and toxic effects is recognized as the major hurdle behind the current "target-rich, lead-poor" scenario. In this study we evaluate the use of several Machine Learning algorithms to find useful rules to the elucidation and prediction of toxicity using ID and 2D molecular descriptors. The results indicate that: i) Machine Learning algorithms can effectively use ID molecular descriptors to construct accurate and simple models; ii) extending the set of descriptors to include 2D descriptors improve the accuracy of the models

Repositório Aberto da Universidade do Porto

Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies

Author: A Bateman
AC Howlett
AJ Enright
AJ Enright
AT Adai
B Rost
B Xu
CL Huang
CS Goh
D Medini
DH Huson
DL Wheeler
EC Meng
G Manning
HM Holden
Holly J. Atkinson
I. King Jordan
J Bockaert
J Boudeau
J Dvorák
J Kazius
JH Morris
JH Morris
JM Young
John H. Morris
JP Huelsenbeck
K Palczewski
L Buck
L Song
M Ashburner
M Bashton
M Murakami
N Saitou
P Bhaumik
P Shannon
P Storz
Patricia C. Babbitt
PC Babbitt
R Wiese
R Wiese
RC Edgar
RD Finn
RS Hall
SC Pegg
SF Altschul
SF Altschul
SG Rasmussen
T Frickey
T Warne
Thomas E. Ferrin
TT Nguyen
V Cherezov
VP Jaakola
W Li
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

The dramatic increase in heterogeneous types of biological data—in particular, the abundance of new protein sequences—requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity—GPCRs and kinases from humans, and the crotonase superfamily of enzymes—we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships

CiteSeerX

Public Library of Science (PLOS)

eScholarship - University of California

Site-Directed Mutations and the Polymorphic Variant Ala160Thr in the Human Thromboxane Receptor Uncover a Structural Role for Transmembrane Helix 4

Author: A Grunbeck
A Sali
A Stojanovic
AB Asenjo
AD Mumford
B Vroling
BT Kinsella
CD Funk
D Toledo
DL Farrens
EM Smyth
F Piscione
FT Khasawneh
GG Krivov
J Kazius
J Standfuss
J Upadhyaya
JA Ballesteros
JK Bowers
JM Johnston
John Hwa
K Tanaka
Karl-Wilhelm Koch
L Geng
M Arakawa
M Arakawa
M Arakawa
M Eilers
M Hirata
M Unoki
MA Hanson
MP Miller
N Guex
NS Palikhe
O Trott
P Chelikani
P Eastman
P Fontana
PC Ng
Prashen Chelikani
RA Laskowski
Raja Chakraborty
S Ahuja
Sai Prasad Pydi
Scott Gleim
SH Kim
Shyamala Dakshinamurti
SJ Hong
SO Smith
SP So
T Hirata
TF Leung
W Guo
W Liu
Z Wang
Publication venue: Public Library of Science
Publication date: 17/01/2012
Field of study

The human thromboxane A2 receptor (TP), belongs to the prostanoid subfamily of Class A GPCRs and mediates vasoconstriction and promotes thrombosis on binding to thromboxane (TXA2). In Class A GPCRs, transmembrane (TM) helix 4 appears to be a hot spot for non-synonymous single nucleotide polymorphic (nsSNP) variants. Interestingly, A160T is a novel nsSNP variant with unknown structure and function. Additionally, within this helix in TP, Ala1604.53 is highly conserved as is Gly1644.57. Here we target Ala1604.53 and Gly1644.57 in the TP for detailed structure-function analysis. Amino acid replacements with smaller residues, A160S and G164A mutants, were tolerated, while bulkier beta-branched replacements, A160T and A160V showed a significant decrease in receptor expression (Bmax). The nsSNP variant A160T displayed significant agonist-independent activity (constitutive activity). Guided by molecular modeling, a series of compensatory mutations were made on TM3, in order to accommodate the bulkier replacements on TM4. The A160V/F115A double mutant showed a moderate increase in expression level compared to either A160V or F115A single mutants. Thermal activity assays showed decrease in receptor stability in the order, wild type>A160S>A160V>A160T>G164A, with G164A being the least stable. Our study reveals that Ala1604.53 and Gly1644.57 in the TP play critical structural roles in packing of TM3 and TM4 helices. Naturally occurring mutations in conjunction with site-directed replacements can serve as powerful tools in assessing the importance of regional helix-helix interactions

Public Library of Science (PLOS)

FigShare

Open Babel: An open chemical toolbox

Author: A Amini
A Andronico
A Bender
A Gakh
A Karwath
A Maunz
A Maunz
A Poater
A Rappe
AA Gakh
AD Hill
B-b Yan
BD McKay
C Helma
C Reynès
Chris Morley
CR Jacob
Craig A James
CW Bullock
D Filimonov
D Lagorce
D Lagorce
D Weininger
DC Bas
DC Lonie
DR Koes
F Fontaine
Geoffrey R Hutchison
GL Holliday
HL Morgan
I Wallach
I Wallach
IV Filippov
IV Tetko
J Ahmed
J Ahmed
J Kazius
J Myers
J Wang
J Wang
JH Chen
JJ Langham
JL Melville
JL Sharman
K Fogel
K Martin
L Fabian
L Liu
L Schietgat
M Brüstle
M Buehler
M Dehmer
M Konyk
M Krier
M Kuhn
MA Meineke
MA Miteva
Michael Banck
MJ Gómez
N O'Boyle
N Zonta
NM O'Boyle
NM O'Boyle
Noel M O'Boyle
O Sperandio
P Lind
P Murray-Rust
P Murray-Rust
P Murray-Rust
P Murray-Rust
P Rydberg
P Tosco
P Tosco
R Esposito
RA Bauer
RA Bauer
RS Armen
S Arbor
S Ingsriswang
SV Trepalin
T Cheng
T Halgren
T Halgren
T Halgren
T Halgren
T Halgren
T Kogej
T Pencheva
Tim Vandermeersch
TWH Backman
U Schmidt
VV Mihaleva
William H Green
X Jiang
X Wang
YD Paila
Z Huang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendorneutral formats. Results: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion. Conclusions: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license fro

CiteSeerX

Springer - Publisher Connector

Irish Universities